A data set is nothing more than a series of rows and columns that contain answers to responses to a survey:
In working through our research questions, we’ll constantly be going back and forth between the actual data (to see the pattern of responses) and the documentation, to figure out the actual question asked as well as how the different responses are coded.
| Hypothetical Survey |
|---|
| Question 1 What neighborhood do you live in? |
| 0-Neighborhood A |
| 1-Neighborhood B |
| 2-Other Neighborhood (please indicate) |
| -9-Don’t Know / Refused |
| … |
| Question 3 What is your income? |
| $__________ (annual number) |
| -9-Don’t Know / Refused |
Some cells of the table above have a negative number. Frequently negative numbers are used to indicate what are called “missing values”. A missing value is a response like “don’t know” or “refused to answer” or “did not answer”. Before we start doing calculations with our data, we’ll want to change negative numbers to true missing values (usually symbolized by a “.”, or an “NA”, so that they don’t goof up our calculations.
Often in a spreadsheet, you’ll see the full text of a question written out (e.g. “What neighborhood do you live in”?) Most programs that work with data are going to want abbreviations (e.g. “Q1” or “neighborhood”) for the questions. These abbreviations should usually have no spaces and be 8 characters or less.